Part 2. Summarizing distributions (part b)
Suppose we have two RVs \(X\) and \(Y\)
We know the joint PMF/PDF \(f(x, y)\) and joint CDF \(F(x, y)\).
How can we summarize the relationship between \(X\) and \(Y\)?
\[\text{Cov}[X, Y] = E\left[ (X - E[X])(Y - E[Y]) \right]\]
Intuitively, “Does \(X\) tend to be above \(E[X]\) when \(Y\) is above \(E[Y]\)? (And by how much?)”
\[ f(x,y) = \begin{cases} 1/3 & x = 0, y = 0 \\\ 1/6 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]
What is \(E[X]\)? What is \(E[Y]\)?
Then compute expectation of \((X - E[X])(Y - E[Y])\) (function of two RVs) as above.
Compare:
\[\begin{align}\text{Cov}[X, Y] &= E\left[ \color{blue}{(X - E[X])}\color{orange}{(Y - E[Y])} \right] \\ \text{V}[X] &= E\left[ \color{blue}{(X - E[X])}\color{blue}{(X - E[X])} \right]\end{align}\]
Plot the points in \(\text{Supp}[X, Y]\) on two axes with point size proportional to \(f(x, y)\).
Divide the \(x, y\) plane into quadrants defined by \(x = E[X]\) and \(y = E[Y]\).
For each point \((x, y) \in \text{Supp}[X, Y]\), create a rectangle with \((x,y)\) at one corner and \((E[X], E[Y])\) at the opposite corner.
Shade the rectangle green in quadrants I and III (where \((x - E[X])(y - E[X]) > 0\)), otherwise red, with intensity proportional to \(f(x,y)\).
Covariance (roughly) measures how much green vs red there is.
First formulation:
\[\text{Cov}[X, Y] = E\left[ (X - E[X])(Y - E[Y]) \right]\]
As with variance, an alternative formulation:
\[\text{Cov}[X, Y] = E\left[XY\right] - E[X]E[Y]\]
Note:
If \(f\) is a linear function or linear operator, then \(f(x + y) = f(x) + f(y)\). (Additivity property.)
Recall linearity of expectations: \(E[X + Y] = E[X] + E[Y]\).
But \(\text{Var}[X + Y] \neq \text{Var}[X] + \text{Var}[Y]\)
Why not?
\[\begin{aligned} \text{Var}(X+Y) &= E[(X + Y - E[X + Y])^2] \\\ &= E[(X - E[X] + Y - E[Y])^2] \\\ &= E[(\tilde{X} + \tilde{Y})^2] \\\ &= E[\tilde{X}^2 + \tilde{Y}^2 + 2 \tilde{X} \tilde{Y}] \\\ &= E[\tilde{X}^2] + E[\tilde{Y}^2] + E[2 \tilde{X} \tilde{Y}] \\\ &= E[(X - E[X])^2] + E[(Y - E[Y])^2] + 2E[(X - E[X])(Y - E[Y])] \\\ &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \end{aligned}\]
The correlation of two RVs \(X\) and \(Y\) with \(\sigma[X] > 0\) and \(\sigma[Y] > 0\) is
\[ \rho[X, Y] = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]}\]
Correlation is scale-invariant: \(\rho[X, Y] = \rho[aX, bY]\) for \(a, b > 0\)
Prove it!
\[\begin{align} \text{Cov}[aX, bY] &= E[aX bY] - E[aX]E[bY] \\ &= ab E[XY] - ab E[X]E[Y] \\ &= ab (E[XY] - E[X]E[Y]) \\ &= ab \text{Cov}[X, Y] \end{align}\]
\[\sigma[aX] = \sqrt{\text{V}[aX]} = \sqrt{a^2 \text{V}[X]} = a \sigma[X]\]
By same argument, \(\sigma[bY] = b\sigma[Y]\).
So
\[\begin{align} \rho[aX, bY] &= \frac{\text{Cov}[aX, bY]}{\sigma[aX] \sigma[bY]} \\ &= \frac{ab \text{Cov}[X, Y]}{a \sigma[X] b \sigma[Y]} = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]} \\ &= \rho[X, y] \end{align}\]
We spent time on expectations:
\[E[Y] = \sum_y y f(y).\]
Also on conditional distributions:
\[f_{Y|X}(y|x) = \frac{f(x, y)}{f_X(x)}\]
Combining the two ideas, we get conditional expectations:
\[E[Y \mid X = x] = \sum_y y f_{Y|X}(y \mid x).\]
i.e. the expectation of \(Y\) at some \(x\).
(Red line represents \(E[Y | X = x]\), dots a sample from \(f(x, y)\))
(Red line represents \(E[Y | X = x]\), dots a sample from \(f(x, y)\))
Two formulations:
\[V[Y | X = x] = E[(Y - E[Y | X =x])^2 | X = x]\] \[V[Y | X = x] = E[Y^2 | X = x] - E[Y | X =x]^2\]
Two formulations:
\[V[Y | X = x] = E[(Y - E[Y | X =x])^2 | X = x]\]
\[V[Y | X = x] = E[Y^2 | X = x] - E[Y | X =x]^2\]
Conditional expectation \(E[Y | X = x]\) is for a specific \(x\).
Conditional expectation function (CEF) \(E[Y | X]\) is for all \(x\).
The CEF \(E[Y | X]\) is the expectation of \(Y\) at each \(X\).
We already established that the expectation/mean is the best (in MSE sense) predictor.
So CEF is the best possible way to use \(X\) to predict \(Y\). (See Theorem 2.2.20.)
Multivariate generalization: \(E[Y \mid X_1, X_2, X_3, \ldots, X_n]\) is the best way to use \(X_1, \ldots X_n\) to predict \(Y\).
For random variables \(X\) and \(Y\),
\[E[Y] = E[E[Y | X]]\]
This means there are two ways to get \(E[Y]\):
In words: An unconditional average (\(E[Y]\)) can be represented as a weighted average of conditional expectations (\(E[Y \mid X]\)) with weights taken from the distribution of the variable conditioned on, i.e. \(X\).
Why would you want to do that?
A population is 80% female and 20% male.
The average age among females (\(E[Y | X = 1]\)) is 25. The average age among males \(E[Y | X = 0]\) is 20.
What is the average age in the population \(E[Y]\)?
\[E[E[Y | X]] = .8 \times 25 + .2 \times 20 = 24\]
See homework for another example.
Suppose we want to measure the average effect of participating in a program (e.g. job training, voter education, military mobilization).
Call \(Y\) the (unobservable) effect of the treatment. We want the average treatment effect (ATE), \(E[Y]\).
Suppose that comparing participants and non-participants gives us a good estimate of the average treatment effect only within subgroups defined by age (\(X\)).
So we have \(E[Y \mid X]\).
Now we just combine these estimates (by LIE): \(E[Y] = E[E[Y \mid X]] = \sum_{x} E[Y \mid X = x] f(x)\)
\[V[Y] = E[V[Y|X]] + V[E[Y|X]]\]
In words, the variance of \(Y\) can be decomposed into the expected conditional variance (\(E[V[Y|X]]\)) and the variance of the conditional expectation (\(V[E[Y|X]]\)).
Sometimes called “Ev(v)e’s law” because
\[V[Y] = \color{red}{E}[\color{red}{V}[Y|X]] + \color{red}{V}[\color{red}{E}[Y|X]]\]
Suppose we want to predict \(Y\) using \(X\), and we focus on a linear predictor, i.e. a function of the form \(\alpha + \beta X\).
The best (minimum MSE) predictor satisfies
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(Y - (a + bX)\right)^2]\]
The solution (see Theorem 2.2.21) is
So we could obtain the BLP from a joint PMF. (See homework.)
Above, we were looking for best linear predictor (BLP) of \(Y\) as function of \(X\):
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(Y - (a + bX)\right)^2]\]
Same answer if you look for the best linear predictor of the CEF \(E[Y | X]\):
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(\mathrm{E}[Y|X] - (a + bX)\right)^2]\]